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Abstract. Social bookmarking systems allow users to organise col- 
lections of resources on the Web in a collaborative fashion. The in- 
creasing popularity of these systems as well as first insights into 
their emergent semantics have made them relevant to disciplines like 
knowledge extraction and ontology learning. The problem of devis- 
ing methods to measure the semantic relatedness between tags and 
characterizing it semantically is still largely open. Here we analyze 
three measures of tag relatedness: tag co-occurrence, cosine similar- 
ity of co-occurrence distributions, and FolkRank, an adaptation of the 
PageRank algorithm to folksonomies. Each measure is computed on 
tags from a large-scale dataset crawled from the social bookmarking 
system del.icio.us. To provide a semantic grounding of our findings, a 
connection to WordNet (a semantic lexicon for the English language) 
is established by mapping tags into synonym sets of WordNet, and 
applying there well-known metrics of semantic similarity. Our results 
clearly expose different characteristics of the selected measures of 
relatedness, making them applicable to different subtasks of knowl- 
edge extraction such as synonym detection or discovery of concept 
hierarchies. 

1 Introduction 

Social bookmarking systems have become extremely popular in re- 
cent years. Their underlying data structures, known as folksonomies, 
consist of a set of users, a set of free-form keywords (called tags), 
a set of resources, and a set of tag assignments, i. e., a set of 
user/tag/resource triples. As folksonomies are large-scale bodies of 
lightweight annotations provided by humans, they are becoming 
more and more interesting for research communities that focus on 
extracting machine-processable semantic structures from them. The 
structure of folksonomies, however, differs fundamentally from that 
of e.g., natural text or web resources, and sets new challenges for the 
fields of knowledge discovery and ontology learning. Crucial hereby 
are the concepts of similarity and relatedness. Here we will focus 
on similarity and relatedness of tags, because this affords compari- 
son with well-established measures of similarity in existing lexical 
databases. 

Ref. {2 1 points out that similarity can be considered as a special 
case of relatedness. As both similarity and relatedness are semantic 
notions, one way of defining them for a folksonomy is to map the tags 
to a thesaurus or lexicon like Roget's thesauru^Jor WordNet (6), and 
to measure the relatedness there by means of well-known metrics. 
The other option is to define measures of relatedness directly on the 
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network structure of the folksonomy. There are several obvious pos- 
sibilities and most of them use statistical information about different 
types of co-occurrence between tags, resources and users. Another 
possibility is to adopt the distributional hypothesis (7] 1111 . which 
states that words found in similar contexts tend to be semantically 
similar. One important reason for using distributional measures in 
folksonomies instead of mapping tags to a thesaurus is the observa- 
tion that the vocabulary of folksonomies includes many community- 
specific terms which did not make it yet into any lexical resource. 

The distributional hypothesis is also at the basis of a number of ap- 
proaches to synonym acquisition from text corpora 0. As in other 
ontology learning scenarios, clustering techniques are often applied 
to group similar terms extracted from a corpus, and a core building 
block of such procedure is the metric used to judge term similarity. 
In order to adapt these approaches to folksonomies, several distribu- 
tional measures of tag relatedness have been used in theory or im- 
plemented in applications 1 12, 24|. In most studies, however, the se- 
lected measures of relatedness seem to have been chosen in a rather 
ad-hoc fashion. We believe that a deeper insight into the semantic 
properties of relatedness measures is an important prerequisite for 
the design of ontology learning procedures that are capable of suc- 
cessfully harvesting the emergent semantics of a folksonomy. 

In this paper, we consider the three following measures for the re- 
latedness of tags: the co-occurrence count, the cosine similarity 1231 
of co-occurrence distributions, and FolkRank 1131 , a graph-based 
measure that is an adaptation of PageRank |20| to folksonomies. Our 
analysis is based on data from a large-scale snapshot of the popu- 
lar social bookmarking system del.icio.us^] To provide a semantic 
grounding of our folksonomy-based measures, we map the tags of 
del.icio.us to synsets of WordNet and use the semantic relations of 
WordNet to infer corresponding semantic relations in the folkson- 
omy. In WordNet, we measure the similarity by using both the taxo- 
nomic path length and a similarity measure by Jiang and Conrath 1141 
that has been validated through user studies and applications J2)- The 
use of taxonomic path lengths, in particular, allows us to inspect the 
edge composition of paths leading from one tag to the correspond- 
ing related tags, and such a characterization proves to be especially 
insightful. 

The paper is organized as follows: In the next section, we discuss 
related work. In Section [3] we provide a definition of folksonomy 
and describe the del.icio.us data on which our experiments are based. 
Section [4] describes the three measures of relatedness that we will 
analyze. Section [5] provides first examples and qualitative insights. 
The semantic grounding of the measures in WordNet is described in 
Section|6] We discuss our results in the context of ontology learning 
in Section|7] where we also point to future work. 
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2 Related Work 

One of the first scientific publications about folksonomies is 1171 . 
where several concept of bottom-up social annotation are introduced. 
Ref. 1 15 18 1 introduce a tri-partite graph representation for folk- 
sonomies, where nodes are users, tags and resources. Ref. (9) pro- 
vides a first quantitative analysis of del.icio.us. 

A considerable number of investigations is motivated by the vi- 
sion of "bridging the gap" between the Semantic Web and Web 2.0 
by means of ontology-leaming procedures based on folksonomy an- 
notations. Ref. 1 18 1 provides a model of semantic-social networks for 
extracting lightweight ontologies from del.icio.us. Other approaches 
for learning taxonomic relations from tags are 1 12 24). Ref. 1101 
presents a generative model for folksonomies and also addresses the 
learning of taxonomic relations. Ref. |25 1 applies statistical methods 
to infer global semantics from a folksonomy. The distribution of tag 
co-occurrence frequencies has been investigated in 1 3 1 and the net- 
work structure of folksonomies was investigated in |4 |. 

After comparing distributional measures on natural text with mea- 
sures for semantic relatedness in thesauri like WordNet, 1191 con- 
cluded that "distributional measures [. . . ] can easily provide domain- 
specific similarity measures for a large number of domains [...]." 
Our work presented in this paper indicates that these findings can be 
transferred to folksonomies. 

3 Folksonomy Definition and Data 

In the followin we will use the definition of folksonomy provided 
infT3): 

Definition A folksonomy is a tuple F := (U, T, R, Y) where U, 
T, and R are finite sets, whose elements are called users, tags and 
resources, respectively., and Y is a ternary relation between them, 
i. e., Y C U x T x R. A post is a triple (u, T ur , r) with u e U,r G, 
and T ur := {t G T (u, t, r) G Y}. 

Users are typically represented by their user ID, tags may be ar- 
bitrary strings, and resources depend on the system and are usually 
represented by a unique ID. 

For our experiments we used data from the social bookmarking 
system del.icio.us, collected in November 2006. As one main fo- 
cus of this work is to characterize tags by their distribution of co- 
occurrence with other tags, we restricted our data to the 10,000 most 
frequent tags of del.icio.us, and to the resources/users that have been 
associated with at least one of those tags. One could argue that tags 
with low frequency have a higher information content in principle 
— but their inherent sparseness of co-occurrence makes them less 
useful for the study of distributional measures. The restricted folk- 
sonomy consists of \U\ = 476,378 users, \T\ = 10,000 tags, 
\R\ = 12,660,470 resources, and \Y\ = 101, 491, 722 tag assign- 
ments. 

4 Measures of Relatedness 

A folksonomy can be also regarded as an undirected tri-partite hyper- 
graph G = (V, E), where V = U U T U R is the set of nodes, and 
E — {{u,t, r} j (u,t,r) G Y} is the set of hyper-edges. Alter- 
natively, the folksonomy hyper-graph can be represented as a three- 
dimensional (binary) adjacency matrix. In Formal Concept Analy- 
sis (8) this structure is known as a triadic context 1161 . All these 
equivalent notions make explicit that folksonomies are special cases 
of three-mode data. Since measures for similarity and relatedness are 



not well developed for three-mode data yet, we will consider two- 
and one-mode views on the data. These two views will be com- 
plemented by a graph-based approach for discovering related tags 
(FolkRank) which makes direct use of the three-mode structure. 

Co- Occurrence 

Given a folksonomy (U,T,R,Y), we define the tag-tag co- 
occurrence graph as a weighted, undirected graph, whose set of ver- 
tices is the set T of tags, and where two tags t\ and ti are connected 
by an edge, iff there is at least one post (it, T ur , r) with ti,ta G T ur - 
The weight of this edge is given by the number of posts that contain 
both t\ and ti, i. e., 

w(ii,ts) := card{(w,r) G U X R \ ti,t 2 € T ur } ■ (1) 

Co-occurrence relatedness between tags is given directly by the 
edge weights. For a given tag t G T, the tags that are most related to 
it are thus all tags t' G T with t' 7^ t such that w(t, t') is maximal. 
In the sequel, we will denote the co-occurrence relatedness also by 
freq. 

Cosine Similarity 

We introduce a distributional measure of tag relatedness by com- 
puting the cosine similarity of tag-tag co-occurrence distributions. 
Specifically, we compute the cosine similarity |23| in the vector 
space R T , where each tag t is represented by a vector v t G R T with 
v tt i := w(t, t') for t 7^ £' G T and v tt — 0. The reason for giving 
weight zero between a node and itself is that we want two tags to 
be considered related when they occur in a similar context, and not 
when they occur together. 

If two tags t\ and ti are represented by v\,V2 G R™, then their 
cosine similarity is defined as: 

cossimfti, t2) ■= arccos/(?Ti, V2) = ^ — ., (2) 

MI2 ■ \\v2W2 

FolkRank 

The PageRank algorithm 1 1 1 reflects the idea that a web page is im- 
portant if there are many pages linking to it, and if those pages are 
important themselves. The same principle was employed for folk- 
sonomies in 1 13]: a resource which is tagged with important tags by 
important users becomes important itself. The same holds, symmet- 
rically, for tags and users. By modifying the weights for a given tag 
in the random surfer vector, FolkRank can compute a ranked list of 
relevant tags. Ref. 1 13] provides a detailed description. 

5 Qualitative insights 

Using each of the three measures introduced above, we computed, 
for each of the 10, 000 most frequent tags of del.icio.us, its most 
closely related tags. Tables [T]~j3] show a few selected examples. We 
observe that in many cases the cosine similarity provides more syn- 
onyms than the other measures. For instance, for tag web2.0 is re- 
turns some of its other commonly used spellings QFor tag games, the 
cosine similarity also provides tags that one could consider as seman- 
tically similar (like the singular form game or its German and French 

5 The tag "web at the fourth position is likely to stem from some user who 
typed in "web 2.0" which in the earlier del.icio.us was interpreted as two 
separate tags "web and 2.0". 



Table 1. Examples of most related tags measured by co-occurrence 



rank 


tag 


1 


2 


3 


4 


5 


13 


web2.0 


ajax 


web 


tools 


blog 


webdesign 


15 


howto 


tutorial 


reference 


tips 


linux 


programming 


28 


games 


fun 


flash 


game 


free 


software 


30 


java 


programming 


development 


opensource 


software 


web 


39 


opensource 


software 


linux 


programming 


tools 


free 


1152 


tobuy 


shopping 


books 


book 


design 


toread 



Table 2. Examples of most related tags measured by cosine similarity 



rank 


tag 


1 


2 


3 


4 


5 


13 


web2.0 


web2 


web-2.0 


webapp 


"web 


web_2.0 


15 


howto 


how-to 


guide 


tutorials 


help 


how_to 


28 


games 


game 


timewaster 


spiel 


jeu 


bored 


30 


java 


python 


perl 


code 


C++ 


delphi 


39 


opensource 


open .source 


open-source 


open. source 


OSS 


foss 


1152 


tobuy 


wishlist 


to_buy 


buyme 


wish-list 


iwant 



Table 3. Examples of most related tags measured by Folkrank 



rank 


tag 


1 


2 


3 


4 


5 


13 


web2.0 


web 


ajax 


tools 


design 


blog 


15 


howto 


reference 


linux 


tutorial 


programming 


software 


28 


games 


game 


fun 


flash 


software 


programming 


30 


Java 


programming 


development 


software 


ajax 


web 


39 


opensource 


software 


linux 


programming 


tools 


web 


1152 


tobuy 


toread 


shopping 


design 


books 


music 



translations spiel and jeii), while the other two measures provide re- 
lated tags like fun or software. The same observation is also made 
for the "functional" tag tobuy (see (9)), where the cosine similar- 
ity provides tags with equivalent functional value, whereas the other 
measures provide rather categories of items one could buy. An inter- 
esting observation is also that java and python could be considered as 
siblings in some suitable concept hierarchy. A possible justification 
for these different behaviors is that the cosine measure is measuring 
the frequency of co-occurrence with other words in the global con- 
texts, whereas the co-occurrence measure and — to a lesser extent — 
FolkRank measure the frequency of co-occurrence with other words 
in the same posts. We will substantiate this assumption later in the 
paper on a more general level. 



Table 4. Overlap between the ten most closely related tags, 
freq-folkrank I cosine-freq I cosine-folkrank 



6.7 



1.7 



1.1 



The first natural aspect to investigate is whether the most closely 
related tags are shared across relatedness measures. We consider the 
10, 000 most popular tags in del.icio.us, and for each of them we 
compute the 10 most related tags according to each of the related- 
ness measures. Table|4]reports the average number of shared tags for 
the three relatedness measures. We observe that relatedness by co- 
occurrence (freq) and by FolkRank share a large fraction of the 10 
most closely related tags, while the cosine relatedness displays little 
overlap with both of them. To better investigate this point, we plot in 




2500 5000 7500 10000 
tag rank 

Figure 1. Average rank of the related tags as a function of the rank of the 
original tag. 



Figure [T] the average rank (according to global frequency) of the 10 
most closely related tags as a function of the rank of the original tag. 
The average rank of the tags obtained by co-occurrence relatedness 
(black) and by FolkRank (green) is low and increases slowly with the 
rank of the original tag: this points out that most of the related tags 
are among the high-frequency tags, independently of the original tag. 
On the contrary, the cosine relatedness (red curve) displays a differ- 
ent behavior: the rank of related tags increases much faster with that 
of the original tag. That is, the tags obtained from cosine-similarity 
relatedness belong to a broader class of tags, not strongly correlated 
with rank (frequency)]^] 



' Notice that the curve for the cosine-similarity relatedness (red) approaches 
a value of 5 000 for high ranks: this is the value one would expect if tag 
relatedness was independent from tag rank. 



6 Semantic Grounding 

In this section we shift perspective and move from the qualitative 
discussion of Section|5]to a more formal validation. Our strategy is 
to ground the relations between the original and the related tags by 
looking up the tags in a formal representation of word meanings. As 
structured representations afford the definition of well-defined met- 
rics of semantic similarity, one can investigate the type of semantic 
relations that hold between the original tags and their related tags 
(obtained by using any of the relatedness measures we study). 

In the following we ground our measures of tag relatedness by 
using WordNet (6), a semantic lexicon of the English language. In 
WordNet words are grouped into synsets, sets of synonyms that rep- 
resent one concept. Synsets are nodes in a network and links between 
synsets represent semantic relations. 

For nouns and verbs it is possible to restrict the links in the net- 
work to (directed) is-a relationships only, so that a subsumption hi- 
erarchy can be defined. The is-a relation connects a hyponym (more 
specific synset) to a hypernym (more general synset). Since the is-a 
WordNet network for nouns and verbs consists of several discon- 
nected hierarchies, it is useful to add a (fake) global root node sub- 
suming all the roots of those hierarchies, making the graph fully con- 
nected and allowing the definition of several graph-based similarity 
metrics between pairs of nouns and pairs of verbs. We will use such 
measures to ground our tag-based measures of relatedness in folk- 
sonomies. 

We measure the similarity in WordNet using both the taxonomic 
shortest-path length and a distance measure introduced by Jiang 
and Conrath |14| that combines the taxonomic path length with 
an information- theoretic similarity measure by Resnik [22). We 
use the implementation of those measures available in the Word- 
Net: Similarity library [21 ]. We remark that [2] provides a pragmatic 
grounding of the Jiang-Conrath measure by means of user studies 
and by its superior performance in the correction of spelling errors. 
This way, our semantic grounding in WordNet of the folksonomy 
similarity measures is extended to a pragmatic grounding in the ex- 
periments of 1 2 1 . 

The program outlined above is only viable if a significant fraction 
of the popular tags in del.icio.us is also present present in WordNet. 
Several factors limit the WordNet coverage of del.icio.us tags: Word- 
Net only covers the English language and contains a static body of 
words, while del.icio.us contains tags from different languages and is 
an open-ended system. This is not a big problem in practice because, 
to date, the vast majority of del.icio.us tags are grounded in the En- 
glish language. Another limiting factor is the structure of WordNet 
itself, where the measures described above can only be implemented 
for nouns and verbs, separately. Many tags are actually adjectives 1 9 1 
and although their grounding is possible no distance based on the 
subsumption hierarchy can be computed in the adjective partition 
of WordNet. Nevertheless, the nominal form of the adjective is of- 
ten covered by the noun partition. Despite this, if we consider the 
popular tags in del.icio.us, a significant fraction of them is actually 
covered by WordNet: Roughly 61% of the 10 000 most frequent tags 
in del.icio.us can be found in WordNet. In the following, to make 
contact with the previous sections, we will focus on these tags. 

Table 5. Average semantic distance, measured in WordNet, from the origi- 
nal tag to the most closely related one. 



similarity metric 


freq 


folkrank 


cosine 


shortest path 


7.4 


7.8 


6.3 


Jiang-Conrath 


13.1 


13.6 


10.8 



A first assessment of the measures of relatedness can be carried 
out by measuring - in WordNet - the average semantic distance be- 
tween a tag and the corresponding most closely related tag according 
to each one of the relatedness measures we consider. Given a mea- 
sure of relatedness, we loop over the tags that are both in del.icio.us 
and WordNet, and for each of those tags we use the chosen measure 
of relatedness to find the corresponding most related tag. If the most 
related tag is also in WordNet, we measure the semantic distance be- 
tween the synsets that contain the original tag and the most closely re- 
lated tag, respectively. In the case of the shortest-path distance, if any 
of the tags occurs in more than one synset, we select synsets which 
minimizes the path length. Table[3]reports the average semantic dis- 
tance, computed in WordNet by using both the (edge) shortest-path 
length and the Jiang-Conrath distance. The cosine relatedness points 
to tags that are semantically closer according to both measures. We 
remark once more that the Jiang-Conrath measure has been validated 
in user studies |2|, so that Table [5] actually deals with distances cog- 
nitively perceived by human subjects. The closer semantic proximity 
of tags obtained by cosine relatedness was intuitively apparent from 
the comparison of Table [2] with Table [T] and Table [3] but now we are 
able to ground this statement through user-validated measures based 
on the subsumption hierarchy of WordNet. 

As noted in Section [5] the tags obtained via the cosine-similarity 
relatedness measure appear to be "synonyms" or "siblings" of the 
original tag, while the two other measures of relatedness seem to 
provide "more general" tags. The possibility of locating tags in the 
WordNet hierarchy allows us to be more precise about the nature of 
these relations. In the rest of this section we will focus on the short- 
est paths in WordNet that lead from an initial tag to its most closely 
related tag (according to the different similarity measures), and char- 
acterize the length and edge composition (hypernym/hyponym) of 
such paths. 

Table 6. Probabilities of the lengths of the shortest path leading from the 
original tag to the most closely related one. Path lengths are computed using 
the subsumption hierarchy in WordNet. 



shortest path length 





1 


2 


> 3 


freq 


0.05 


0.04 


0.06 


0.85 


folkrank 


0.04 


0.04 


0.05 


0.87 


cosine 


0.18 


0.03 


0.09 


0.70 



Table [6] summarizes the probabilities of the shortest-path lengths 
n (number of edges) connecting a tag to its closest related tag in 
WordNet. The FolkRank and co-occurrence relatedness have sim- 
ilar probabilities. The cosine relatedness displays higher values at 
n = and n — 2 and a comparatively depleted number of paths 
with n — 1. The higher value at n = is due to the detection of 
actual synonyms; i. e., the cosine relatedness, in about 18 % of the 
cases, points to a tag which belongs to the same synset of the orig- 
inal tag. The smaller number of paths with n = 1 (one single edge 
in WordNet) is consistent with the idea that the cosine relatedness 
favors siblings/synonymous tags: moving by a single edge, instead, 
leads to either a hypernym or a hyponym in the WordNet hierarchy, 
never to a sibling. The higher value at n = 2 (paths with two edges 
in WordNet) may be compatible with the sibling relation, but in order 
to ascertain it we have to characterize the average edge composition 
of these paths. 

Figure [2] displays the average edge type composition (hyper- 
nym/hyponym edges) for paths of length 1 and 2. For the cosine- 
similarity relatedness (blue), we observe that the paths with n = 2 
(right-hand side of Figure |2j consist almost entirely (90%) of one 
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Figure 2. Edge composition of the shortest paths of length 1 (left) and 2 
(right). An "up" edge leads to a hypernym, while a "down" edge leads to a 
hyponym. 

hypernym edge (up) and one hyponym edge (down), i. e., these paths 
do lead to siblings. Notice how the path composition is very different 
for the other relatedness measures: in those cases roughly half of the 
paths consist of two hypernym edges in the WordNet hierarchy. We 
observe a similar behavior for n — 1, where the cosine relatedness 
has no statistically preferred direction, while the other measures of 
relatedness point preferentially to hypernyms. 

7 Discussion and Perspectives 

The main contribution of this paper is a methodological one. Sev- 
eral measures of relatedness have been proposed in the literature, but 
given the fluid and open-ended nature of social bookmarking sys- 
tems, it is hard to characterize - from the semantic point of view - 
what kind of relations they establish. As these relations constitute 
an important building block for extracting formalized knowledge, a 
deeper understanding of these measures is needed. Here we proposed 
to ground different measures of tag relatedness in a folksonomy by 
mapping del.icio.us tags, when possible, on WordNet synsets and 
using well-established measures of semantic distance in WordNet to 
gain insight into their respective characteristics. 

Our results can be taken as indicators that the choice of an appro- 
priate relatedness measure is able to yield valuable input for learn- 
ing semantic term relationships from folksonomies. We will close by 
briefly discussing which of the three relatedness measures we studied 
is best for . . . 



• ... synonym discovery. The cosine similarity is clearly the measure 
to choose when one would like to discover synonyms. As shown 
in this work, cosine similarity delivers not only spelling variants 
but also terms that belong to the same WordNet synset. 

• ... concept hierarchy. Both FolkRank and co-occurrence related- 
ness seemed to yield more general tags in our analyses. This is 
why we think that these measures provide valuable input for algo- 
rithms to extract taxonomic relationships between tags. 

• ... discovery of multi-word lexemes. Depending on the allowed tag 
delimiters, it can happen that multi-word lexemes end up as sev- 
eral tags. Our experiment indicates that FolkRank is best to dis- 
cover these cases. For the tag open, for instance, it is the only of 
the three algorithms which has source within the ten most related 
tags and vice versaQ 

7 Open is at position 6 for source, and source is at position 3 for open. 



Future work includes the analysis of further relatedness measures, 
e. g., based on representations in the vector spaces spanned by the 
users or resources. We are furthermore currently working on adapting 
existing ontology learning techniques to folksonomies, including the 
presented measures. 
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